Computation and Language
☆ Rewarding Chatbots for Real-World Engagement with Millions of Users
Robert Irvine, Douglas Boubert, Vyas Raina, Adian Liusie, Vineet Mudupalli, Aliaksei Korshuk, Zongyi Liu, Fritz Cremer, Valentin Assassi, Christie-Carol Beauchamp, Xiaoding Lu, Thomas Rialan, William Beauchamp
The emergence of pretrained large language models has led to the deployment
of a range of social chatbots for chitchat. Although these chatbots demonstrate
language ability and fluency, they are not guaranteed to be engaging and can
struggle to retain users. This work investigates the development of social
chatbots that prioritize user engagement to enhance retention, specifically
examining the use of human feedback to efficiently develop highly engaging
chatbots. The proposed approach uses automatic pseudo-labels collected from
user interactions to train a reward model that can be used to reject
low-scoring sample responses generated by the chatbot model at inference time.
Intuitive evaluation metrics, such as mean conversation length (MCL), are
introduced as proxies to measure the level of engagement of deployed chatbots.
A/B testing on groups of 10,000 new daily chatbot users on the Chai Research
platform shows that this approach increases the MCL by up to 70%, which
translates to a more than 30% increase in user retention for a GPT-J 6B model.
Future work aims to use the reward model to realise a data fly-wheel, where the
latest user conversations can be used to alternately fine-tune the language
model and the reward model.
☆ Susceptibility to Influence of Large Language Models
Lewis D Griffin, Bennett Kleinberg, Maximilian Mozes, Kimberly T Mai, Maria Vau, Matthew Caldwell, Augustine Marvor-Parker
Two studies tested the hypothesis that a Large Language Model (LLM) can be
used to model psychological change following exposure to influential input. The
first study tested a generic mode of influence - the Illusory Truth Effect
(ITE) - where earlier exposure to a statement (through, for example, rating its
interest) boosts a later truthfulness test rating. Data was collected from 1000
human participants using an online experiment, and 1000 simulated participants
using engineered prompts and LLM completion. 64 ratings per participant were
collected, using all exposure-test combinations of the attributes: truth,
interest, sentiment and importance. The results for human participants
reconfirmed the ITE, and demonstrated an absence of effect for attributes other
than truth, and when the same attribute is used for exposure and test. The same
pattern of effects was found for LLM-simulated participants. The second study
concerns a specific mode of influence - populist framing of news to increase
its persuasion and political mobilization. Data from LLM-simulated participants
was collected and compared to previously published data from a 15-country
experiment on 7286 human participants. Several effects previously demonstrated
from the human study were replicated by the simulated study, including effects
that surprised the authors of the human study by contradicting their
theoretical expectations (anti-immigrant framing of news decreases its
persuasion and mobilization); but some significant relationships found in human
data (modulation of the effectiveness of populist framing according to relative
deprivation of the participant) were not present in the LLM data. Together the
two studies support the view that LLMs have potential to act as models of the
effect of influence.
comment: 24 pages, 6 figures, 7 tables, 53 references
☆ Is In-hospital Meta-information Useful for Abstractive Discharge Summary Generation?
During the patient's hospitalization, the physician must record daily
observations of the patient and summarize them into a brief document called
"discharge summary" when the patient is discharged. Automated generation of
discharge summary can greatly relieve the physicians' burden, and has been
addressed recently in the research community. Most previous studies of
discharge summary generation using the sequence-to-sequence architecture focus
on only inpatient notes for input. However, electric health records (EHR) also
have rich structured metadata (e.g., hospital, physician, disease, length of
stay, etc.) that might be useful. This paper investigates the effectiveness of
medical meta-information for summarization tasks. We obtain four types of
meta-information from the EHR systems and encode each meta-information into a
sequence-to-sequence model. Using Japanese EHRs, meta-information encoded
models increased ROUGE-1 by up to 4.45 points and BERTScore by 3.77 points over
the vanilla Longformer. Also, we found that the encoded meta-information
improves the precisions of its related terms in the outputs. Our results showed
the benefit of the use of medical meta-information.
☆ Robust Knowledge Distillation from RNN-T Models With Noisy Training Labels Using Full-Sum Loss ICASSP 2023
This work studies knowledge distillation (KD) and addresses its constraints
for recurrent neural network transducer (RNN-T) models. In hard distillation, a
teacher model transcribes large amounts of unlabelled speech to train a student
model. Soft distillation is another popular KD method that distills the output
logits of the teacher model. Due to the nature of RNN-T alignments, applying
soft distillation between RNN-T architectures having different posterior
distributions is challenging. In addition, bad teachers having high
word-error-rate (WER) reduce the efficacy of KD. We investigate how to
effectively distill knowledge from variable quality ASR teachers, which has not
been studied before to the best of our knowledge. We show that a sequence-level
KD, full-sum distillation, outperforms other distillation methods for RNN-T
models, especially for bad teachers. We also propose a variant of full-sum
distillation that distills the sequence discriminative knowledge of the teacher
leading to further improvement in WER. We conduct experiments on public
datasets namely SpeechStew and LibriSpeech, and on in-house production data.
comment: Accepted at ICASSP 2023
☆ Creation and evaluation of timelines for longitudinal user posts EACL 2023
There is increasing interest to work with user generated content in social
media, especially textual posts over time. Currently there is no consistent way
of segmenting user posts into timelines in a meaningful way that improves the
quality and cost of manual annotation. Here we propose a set of methods for
segmenting longitudinal user posts into timelines likely to contain interesting
moments of change in a user's behaviour, based on their online posting
activity. We also propose a novel framework for evaluating timelines and show
its applicability in the context of two different social media datasets.
Finally, we present a discussion of the linguistic content of highly ranked
timelines.
comment: Accepted at EACL 2023 (main, long); camera-ready version
☆ An algebraic approach to translating Japanese
We use Lambek's pregroups and the framework of compositional distributional
models of language ("DisCoCat") to study translations from Japanese to English
as pairs of functors. Adding decorations to pregroups we show how to handle
word order changes between languages.
comment: 20 pages, multiple diagrams and glosses
☆ An Overview on Language Models: Recent Developments and Outlook
Language modeling studies the probability distributions over strings of
texts. It is one of the most fundamental tasks in natural language processing
(NLP). It has been widely used in text generation, speech recognition, machine
translation, etc. Conventional language models (CLMs) aim to predict the
probability of linguistic sequences in a causal manner. In contrast,
pre-trained language models (PLMs) cover broader concepts and can be used in
both causal sequential modeling and fine-tuning for downstream applications.
PLMs have their own training paradigms (usually self-supervised) and serve as
foundation models in modern NLP systems. This overview paper provides an
introduction to both CLMs and PLMs from five aspects, i.e., linguistic units,
structures, training methods, evaluation methods, and applications.
Furthermore, we discuss the relationship between CLMs and PLMs and shed light
on the future directions of language modeling in the pre-trained era.
☆ MIXPGD: Hybrid Adversarial Training for Speech Recognition Systems
Automatic speech recognition (ASR) systems based on deep neural networks are
weak against adversarial perturbations. We propose mixPGD adversarial training
method to improve the robustness of the model for ASR systems. In standard
adversarial training, adversarial samples are generated by leveraging
supervised or unsupervised methods. We merge the capabilities of both
supervised and unsupervised approaches in our method to generate new
adversarial samples which aid in improving model robustness. Extensive
experiments and comparison across various state-of-the-art defense methods and
adversarial attacks have been performed to show that mixPGD gains 4.1% WER of
better performance than previous best performing models under white-box
adversarial attack setting. We tested our proposed defense method against both
white-box and transfer based black-box attack settings to ensure that our
defense strategy is robust against various types of attacks. Empirical results
on several adversarial attacks validate the effectiveness of our proposed
approach.
☆ Clinical BERTScore: An Improved Measure of Automatic Speech Recognition Performance in Clinical Settings
Automatic Speech Recognition (ASR) in medical contexts has the potential to
save time, cut costs, increase report accuracy, and reduce physician burnout.
However, the healthcare industry has been slower to adopt this technology, in
part due to the importance of avoiding medically-relevant transcription
mistakes. In this work, we present the Clinical BERTScore (CBERTScore), an ASR
metric that penalizes clinically-relevant mistakes more than others. We
demonstrate that this metric more closely aligns with clinician preferences on
medical sentences as compared to other metrics (WER, BLUE, METEOR, etc),
sometimes by wide margins. We collect a benchmark of 13 clinician preferences
on 149 realistic medical sentences called the Clinician Transcript Preference
benchmark (CTP), demonstrate that CBERTScore more closely matches what
clinicians prefer, and release the benchmark for the community to further
develop clinically-aware ASR metrics.
☆ MuLTI: Efficient Video-and-Language Understanding with MultiWay-Sampler and Multiple Choice Modeling
Video-and-language understanding has a variety of applications in the
industry, such as video question answering, text-video retrieval and
multi-label classification. Existing video-and-language understanding methods
generally adopt heavy multi-modal encoders and feature fusion modules, which
consume large amounts of GPU memory. Especially, they have difficulty dealing
with dense video frames or long text that are prevalent in industrial
applications. In this paper, we propose MuLTI, a highly accurate and
memory-efficient video-and-language understanding model that achieves efficient
and effective feature fusion through feature sampling and attention modules.
Therefore, MuLTI can handle longer sequences with limited GPU memory. Then, we
introduce an attention-based adapter to the encoders, which finetunes the
shallow features to improve the model's performance with low GPU memory
consumption. Finally, to further improve the model's performance, we introduce
a new pretraining task named Multiple Choice Modeling to bridge the task gap
between pretraining and downstream tasks and enhance the model's ability to
align the video and the text. Benefiting from the efficient feature fusion
module, the attention-based adapter and the new pretraining task, MuLTI
achieves state-of-the-art performance on multiple datasets. Implementation and
pretrained models will be released.
☆ Logic Against Bias: Textual Entailment Mitigates Stereotypical Sentence Reasoning EACL 2023
Due to their similarity-based learning objectives, pretrained sentence
encoders often internalize stereotypical assumptions that reflect the social
biases that exist within their training corpora. In this paper, we describe
several kinds of stereotypes concerning different communities that are present
in popular sentence representation models, including pretrained next sentence
prediction and contrastive sentence representation models. We compare such
models to textual entailment models that learn language logic for a variety of
downstream language understanding tasks. By comparing strong pretrained models
based on text similarity with textual entailment learning, we conclude that the
explicit logic learning with textual entailment can significantly reduce bias
and improve the recognition of social communities, without an explicit
de-biasing process
comment: Accepted by EACL 2023
♻ ☆ Arabic aspect sentiment polarity classification using BERT
Aspect-based sentiment analysis(ABSA) is a textual analysis methodology that
defines the polarity of opinions on certain aspects related to specific
targets. The majority of research on ABSA is in English, with a small amount of
work available in Arabic. Most previous Arabic research has relied on deep
learning models that depend primarily on context-independent word embeddings
(e.g.word2vec), where each word has a fixed representation independent of its
context. This article explores the modeling capabilities of contextual
embeddings from pre-trained language models, such as BERT, and making use of
sentence pair input on Arabic aspect sentiment polarity classification task. In
particular, we develop a simple but effective BERT-based neural baseline to
handle this task. Our BERT architecture with a simple linear classification
layer surpassed the state-of-the-art works, according to the experimental
results on three different Arabic datasets. Achieving an accuracy of 89.51% on
the Arabic hotel reviews dataset, 73% on the Human annotated book reviews
dataset, and 85.73% on the Arabic news dataset.
♻ ☆ GPT-3-driven pedagogical agents for training children's curious question-asking skills
Rania Abdelghani, Yen-Hsiang Wang, Xingdi Yuan, Tong Wang, Pauline Lucas, Hélène Sauzéon, Pierre-Yves Oudeyer
In order to train children's ability to ask curiosity-driven questions,
previous research has explored designing specific exercises relying on
providing semantic and linguistic cues to help formulate such questions. But
despite showing pedagogical efficiency, this method is still limited as it
relies on generating the said cues by hand, which can be a very costly process.
In this context, we propose to leverage advances in the natural language
processing field (NLP) and investigate the efficiency of using a large language
model (LLM) for automating the production of the pedagogical content of a
curious question-asking (QA) training. We study generating the said content
using the "prompt-based" method that consists of explaining the task to the LLM
in natural text. We evaluate the output using human experts annotations and
comparisons with hand-generated content. Results suggested indeed the relevance
and usefulness of this content. We also conduct a field study in primary school
(75 children aged 9-10), where we evaluate children's QA performance when
having this training. We compare 3 types of content : 1) hand-generated content
that proposes "closed" cues leading to predefined questions; 2) GPT-3-generated
content that proposes the same type of cues; 3) GPT-3-generated content that
proposes "open" cues leading to several possible questions. We see a similar QA
performance between the two "closed" trainings (showing the scalability of the
approach using GPT-3), and a better one for participants with the "open"
training. These results suggest the efficiency of using LLMs to support
children in generating more curious questions, using a natural language
prompting approach that affords usability by teachers and other users not
specialists of AI techniques. Furthermore, results also show that open-ended
content may be more suitable for training curious question-asking skills.
♻ ☆ Large Language Models Are Human-Level Prompt Engineers
By conditioning on natural language instructions, large language models
(LLMs) have displayed impressive capabilities as general-purpose computers.
However, task performance depends significantly on the quality of the prompt
used to steer the model, and most effective prompts have been handcrafted by
humans. Inspired by classical program synthesis and the human approach to
prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic
instruction generation and selection. In our method, we treat the instruction
as the "program," optimized by searching over a pool of instruction candidates
proposed by an LLM in order to maximize a chosen score function. To evaluate
the quality of the selected instruction, we evaluate the zero-shot performance
of another LLM following the selected instruction. Experiments on 24 NLP tasks
show that our automatically generated instructions outperform the prior LLM
baseline by a large margin and achieve better or comparable performance to the
instructions generated by human annotators on 19/24 tasks. We conduct extensive
qualitative and quantitative analyses to explore the performance of APE. We
show that APE-engineered prompts can be applied to steer models toward
truthfulness and/or informativeness, as well as to improve few-shot learning
performance by simply prepending them to standard in-context learning prompts.
Please check out our webpage at
https://sites.google.com/view/automatic-prompt-engineer.
♻ ☆ A Kind Introduction to Lexical and Grammatical Aspect, with a Survey of Computational Approaches EACL 2023
Aspectual meaning refers to how the internal temporal structure of situations
is presented. This includes whether a situation is described as a state or as
an event, whether the situation is finished or ongoing, and whether it is
viewed as a whole or with a focus on a particular phase. This survey gives an
overview of computational approaches to modeling lexical and grammatical aspect
along with intuitive explanations of the necessary linguistic concepts and
terminology. In particular, we describe the concepts of stativity, telicity,
habituality, perfective and imperfective, as well as influential inventories of
eventuality and situation types. We argue that because aspect is a crucial
component of semantics, especially when it comes to reporting the temporal
structure of situations in a precise way, future NLP approaches need to be able
to handle and evaluate it systematically in order to achieve human-level
language understanding.
comment: Accepted at EACL 2023, camera ready version
♻ ☆ Self-Adaptive Named Entity Recognition by Retrieving Unstructured Knowledge EACL2023
Although named entity recognition (NER) helps us to extract domain-specific
entities from text (e.g., artists in the music domain), it is costly to create
a large amount of training data or a structured knowledge base to perform
accurate NER in the target domain. Here, we propose self-adaptive NER, which
retrieves external knowledge from unstructured text to learn the usages of
entities that have not been learned well. To retrieve useful knowledge for NER,
we design an effective two-stage model that retrieves unstructured knowledge
using uncertain entities as queries. Our model predicts the entities in the
input and then finds those of which the prediction is not confident. Then, it
retrieves knowledge by using these uncertain entities as queries and
concatenates the retrieved text to the original input to revise the prediction.
Experiments on CrossNER datasets demonstrated that our model outperforms strong
baselines by 2.35 points in F1 metric.
comment: EACL2023 (long)
♻ ☆ Topic Modelling of Swedish Newspaper Articles about Coronavirus: a Case Study using Latent Dirichlet Allocation Method
Topic Modelling (TM) is from the research branches of natural language
understanding (NLU) and natural language processing (NLP) that is to facilitate
insightful analysis from large documents and datasets, such as a summarisation
of main topics and the topic changes. This kind of discovery is getting more
popular in real-life applications due to its impact on big data analytics. In
this study, from the social-media and healthcare domain, we apply popular
Latent Dirichlet Allocation (LDA) methods to model the topic changes in Swedish
newspaper articles about Coronavirus. We describe the corpus we created
including 6515 articles, methods applied, and statistics on topic changes over
approximately 1 year and two months period of time from 17th January 2020 to
13th March 2021. We hope this work can be an asset for grounding applications
of topic modelling and can be inspiring for similar case studies in an era with
pandemics, to support socio-economic impact research as well as clinical and
healthcare analytics. Our data and source code are openly available at
https://github. com/poethan/Swed_Covid_TM Keywords: Latent Dirichlet Allocation
(LDA); Topic Modelling; Coronavirus; Pandemics; Natural Language Understanding;
BERT-topic
comment: 14 pages, 14 figures
♻ ☆ Fillers in Spoken Language Understanding: Computational and Psycholinguistic Perspectives
Disfluencies (i.e. interruptions in the regular flow of speech), are
ubiquitous to spoken discourse. Fillers ("uh", "um") are disfluencies that
occur the most frequently compared to other kinds of disfluencies. Yet, to the
best of our knowledge, there isn't a resource that brings together the research
perspectives influencing Spoken Language Understanding (SLU) on these speech
events. This aim of this article is to survey a breadth of perspectives in a
holistic way; i.e. from considering underlying (psycho)linguistic theory, to
their annotation and consideration in Automatic Speech Recognition (ASR) and
SLU systems, to lastly, their study from a generation standpoint. This article
aims to present the perspectives in an approachable way to the SLU and
Conversational AI community, and discuss moving forward, what we believe are
the trends and challenges in each area.
comment: To appear in TAL Journal
♻ ☆ Word-Graph2vec: An efficient word embedding approach on word co-occurrence graph using random walk sampling
Word embedding has become ubiquitous and is widely used in various text
mining and natural language processing (NLP) tasks, such as information
retrieval, semantic analysis, and machine translation, among many others.
Unfortunately, it is prohibitively expensive to train the word embedding in a
relatively large corpus. We propose a graph-based word embedding algorithm,
called Word-Graph2vec, which converts the large corpus into a word
co-occurrence graph, then takes the word sequence samples from this graph by
randomly traveling and trains the word embedding on this sampling corpus in the
end. We posit that because of the stable vocabulary, relative idioms, and fixed
expressions in English, the size and density of the word co-occurrence graph
change slightly with the increase in the training corpus. So that
Word-Graph2vec has stable runtime on the large scale data set, and its
performance advantage becomes more and more obvious with the growth of the
training corpus. Extensive experiments conducted on real-world datasets show
that the proposed algorithm outperforms traditional Skip-Gram by four-five
times in terms of efficiency, while the error generated by the random walk
sampling is small.
♻ ☆ BERT-Deep CNN: State-of-the-Art for Sentiment Analysis of COVID-19 Tweets
Javad Hassannataj Joloudari, Sadiq Hussain, Mohammad Ali Nematollahi, Rouhollah Bagheri, Fatemeh Fazl, Roohallah Alizadehsani, Reza Lashgari, Ashis Talukder
The free flow of information has been accelerated by the rapid development of
social media technology. There has been a significant social and psychological
impact on the population due to the outbreak of Coronavirus disease (COVID-19).
The COVID-19 pandemic is one of the current events being discussed on social
media platforms. In order to safeguard societies from this pandemic, studying
people's emotions on social media is crucial. As a result of their particular
characteristics, sentiment analysis of texts like tweets remains challenging.
Sentiment analysis is a powerful text analysis tool. It automatically detects
and analyzes opinions and emotions from unstructured data. Texts from a wide
range of sources are examined by a sentiment analysis tool, which extracts
meaning from them, including emails, surveys, reviews, social media posts, and
web articles. To evaluate sentiments, natural language processing (NLP) and
machine learning techniques are used, which assign weights to entities, topics,
themes, and categories in sentences or phrases. Machine learning tools learn
how to detect sentiment without human intervention by examining examples of
emotions in text. In a pandemic situation, analyzing social media texts to
uncover sentimental trends can be very helpful in gaining a better
understanding of society's needs and predicting future trends. We intend to
study society's perception of the COVID-19 pandemic through social media using
state-of-the-art BERT and Deep CNN models. The superiority of BERT models over
other deep models in sentiment analysis is evident and can be concluded from
the comparison of the various research studies mentioned in this article.
comment: 20 pages, 5 figures
♻ ☆ Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach for Speech Emotion Recognition ICASSP 2023
Speech emotion recognition (SER) plays a vital role in improving the
interactions between humans and machines by inferring human emotion and
affective states from speech signals. Whereas recent works primarily focus on
mining spatiotemporal information from hand-crafted features, we explore how to
model the temporal patterns of speech emotions from dynamic temporal scales.
Towards that goal, we introduce a novel temporal emotional modeling approach
for SER, termed Temporal-aware bI-direction Multi-scale Network (TIM-Net),
which learns multi-scale contextual affective representations from various time
scales. Specifically, TIM-Net first employs temporal-aware blocks to learn
temporal affective representation, then integrates complementary information
from the past and the future to enrich contextual representations, and finally,
fuses multiple time scale features for better adaptation to the emotional
variation. Extensive experimental results on six benchmark SER datasets
demonstrate the superior performance of TIM-Net, gaining 2.34% and 2.61%
improvements of the average UAR and WAR over the second-best on each corpus.
The source code is available at https://github.com/Jiaxin-Ye/TIM-Net_SER.
comment: Accepted by ICASSP 2023
♻ ☆ CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a Context Synergized Hyperbolic Network IJCAI 2023
The tremendous growth of social media users interacting in online
conversations has also led to significant growth in hate speech. Most of the
prior works focus on detecting explicit hate speech, which is overt and
leverages hateful phrases, with very little work focusing on detecting hate
speech that is implicit or denotes hatred through indirect or coded language.
In this paper, we present CoSyn, a user- and conversational-context synergized
network for detecting implicit hate speech in online conversation trees. CoSyn
first models the user's personal historical and social context using a novel
hyperbolic Fourier attention mechanism and hyperbolic graph convolution
network. Next, we jointly model the user's personal context and the
conversational context using a novel context interaction mechanism in the
hyperbolic space that clearly captures the interplay between the two and makes
independent assessments on the amounts of information to be retrieved from both
contexts. CoSyn performs all operations in the hyperbolic space to account for
the scale-free dynamics of social media. We demonstrate the effectiveness of
CoSyn both qualitatively and quantitatively on an open-source hate speech
dataset with Twitter conversations and show that CoSyn outperforms all our
baselines in detecting implicit hate speech with absolute improvements in the
range of 8.15% - 19.50%.
comment: Under review at IJCAI 2023
♻ ☆ Self-Attention Networks Can Process Bounded Hierarchical Languages ACL 2021
Despite their impressive performance in NLP, self-attention networks were
recently proved to be limited for processing formal languages with hierarchical
structure, such as $\mathsf{Dyck}_k$, the language consisting of well-nested
parentheses of $k$ types. This suggested that natural language can be
approximated well with models that are too weak for formal languages, or that
the role of hierarchy and recursion in natural language might be limited. We
qualify this implication by proving that self-attention networks can process
$\mathsf{Dyck}_{k, D}$, the subset of $\mathsf{Dyck}_{k}$ with depth bounded by
$D$, which arguably better captures the bounded hierarchical structure of
natural language. Specifically, we construct a hard-attention network with
$D+1$ layers and $O(\log k)$ memory size (per token per layer) that recognizes
$\mathsf{Dyck}_{k, D}$, and a soft-attention network with two layers and
$O(\log k)$ memory size that generates $\mathsf{Dyck}_{k, D}$. Experiments show
that self-attention networks trained on $\mathsf{Dyck}_{k, D}$ generalize to
longer inputs with near-perfect accuracy, and also verify the theoretical
memory advantage of self-attention networks over recurrent networks.
comment: ACL 2021. 19 pages with extended appendix. v2 fixed a small typo in
the formula at the end of page 5 (thank to Gabriel Faria). Code:
https://github.com/princeton-nlp/dyck-transformer
♻ ☆ ReAct: Synergizing Reasoning and Acting in Language Models ICLR
While large language models (LLMs) have demonstrated impressive capabilities
across tasks in language understanding and interactive decision making, their
abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g.
action plan generation) have primarily been studied as separate topics. In this
paper, we explore the use of LLMs to generate both reasoning traces and
task-specific actions in an interleaved manner, allowing for greater synergy
between the two: reasoning traces help the model induce, track, and update
action plans as well as handle exceptions, while actions allow it to interface
with external sources, such as knowledge bases or environments, to gather
additional information. We apply our approach, named ReAct, to a diverse set of
language and decision making tasks and demonstrate its effectiveness over
state-of-the-art baselines, as well as improved human interpretability and
trustworthiness over methods without reasoning or acting components.
Concretely, on question answering (HotpotQA) and fact verification (Fever),
ReAct overcomes issues of hallucination and error propagation prevalent in
chain-of-thought reasoning by interacting with a simple Wikipedia API, and
generates human-like task-solving trajectories that are more interpretable than
baselines without reasoning traces. On two interactive decision making
benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and
reinforcement learning methods by an absolute success rate of 34% and 10%
respectively, while being prompted with only one or two in-context examples.
Project site with code: https://react-lm.github.io
comment: v3 is the ICLR camera ready version with some typos fixed. Project
site with code: https://react-lm.github.io